knitr::opts_chunk$set(echo = TRUE)
The objective of this workflow example is to combine a series of Data Science skills. We will be going through all the stages of a typical Social Science project from scraping data, to constructing databases, to visualizing and interpreting that data.
For simplicitiy, I divide this workflow into sections, where I demonstrate a key skill with a specifc purpose. Of course, in the real world such steps are interchangeable and can be done many different ways!
For this project, we are going to gather information about research universities in the US, which are ranked as R1 (Very High Research Activity) and R2 (High Research Activity). We will gather information about these from different sources, save it into tables and upload them to a relational database. From our database, we can then use SQL commands to conduct fast, simple analysis.
Without further adue, let us run the command to create a database. Note that if you’re working in R markdown, this will create the database in the directory where your .Rmd file is saved.
# Here, we'll need the DBI, and the RSQLite package
library(RSQLite)
library(DBI)
# creating the database with
demo_db <- DBI::dbConnect(RSQLite::SQLite(), "demo_db.db")
# checking that the database created exists
file.exists("demo_db.db")
## [1] TRUE
The above code snippet should return “TRUE” when you run it on your machine. It will have created the database, which is now empty, on your hard-drive. We’ll be scraping/downloading the data now, and adding it to our database using a primary key as we go.
Here, I’ll be writing an automatic function that will scrape this page. Note that as always, when scraping pages like this, we’re always susceptible to changes in the layout, content of the web-page. This is particularly the case for Wikipedia, where users can freely edit at any time and revert any changes. For this reason, we aim to scrape once and save our data immediately!
We’ll be collecting:
Upon close inspection, the Wikipedia page is made up of 3 large tables, R1, R2, R3 level univerisies. But we only care about R1 and R2 institutions here. I make a function that takes no argument because I’ll only be able to use it for this wikipedia page, due to its unique layout, but of course you could make one with generic arguments that you can keep in your repository and re-use!
Here’s the scraper:
# here we'll need the tidyverse package to make a tibble
# and the xml2 package to process the selectors of the different parts of the website
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(xml2)
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
uni_scraper <- function() {
# Specify the URL of the Wikipedia page
uni_url <- "https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States"
# Read the HTML content from the specified URL
page <- read_html(uni_url)
# Extract all tables containing information about research universities
universities_tables <- html_elements(page, css = "table.wikitable")
# Initialize empty vectors to store the data
names <- c()
status <- c()
city <- c()
state <- c()
urls <- c()
# Loop through each table (but limit to tables 1-2 since the others are not R1 or R2 research centers)
for (table_index in seq_along(universities_tables[1:2])) {
universities_table <- universities_tables[[table_index]]
# Extract the rows from the table
rows <- html_elements(universities_table, "tr")
# Loop through each row starting from the second row (skipping header)
for (i in 2:length(rows)) {
row <- rows[i]
# Extract columns from the row
columns <- html_elements(row, "td")
# Extract data and append to the respective vectors
names <- c(names, html_text(columns[1]))
status <- c(status, html_text(columns[2]))
city <- c(city, html_text(columns[3]))
state <- c(state, html_text(columns[4]))
# Extract the URL from the first column
url_element <- html_element(row, "a") %>% html_attr("href")
full_url <- ifelse(is.na(url_element), NA, paste("https://en.wikipedia.org", url_element, sep = ""))
urls <- c(urls, full_url)
}
}
# Create a data frame with the collected data
universities_data <- tibble(
Name = names,
Status = status,
City = city,
State = state,
URL = urls)
return(universities_data)
}
Now that we made the function, we’ll need to call it, so that it actually executes. Note that it is good practice to keep this function separate from its call in a Rmarkdown file so that when you knit, you avoid re-scraping all the data.
# Call the scraping function
R1R2_table <- uni_scraper()
head(R1R2_table)
## # A tibble: 6 × 5
## Name Status City State URL
## <chr> <chr> <chr> <chr> <chr>
## 1 Arizona State University Public Tempe "AZ\n" https://en…
## 2 Auburn University Public Auburn "AL\n" https://en…
## 3 Baylor University Private (non-profit) Waco "TX\n" https://en…
## 4 Binghamton University Public Vestal "NY\n" https://en…
## 5 Boston College Private (non-profit) Chestnut Hill "MA\n" https://en…
## 6 Boston University Private (non-profit) Boston "MA\n" https://en…
And that’s it! We’ve saved all this information in a table.
Now let’s navigate to individual university pages, following the hyperlinks in the tables, and gather:
Of course, we could have done this in the same function, but for modularity purposes I do it in two functions. (And as you can see, this second one is already long enough on its own!)
scrape_university_extra <- function() {
# Read the HTML content from the specified URL
url <- "https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States"
page <- read_html(url)
# Extract all tables containing information about universities
universities_tables <- html_elements(page, css = "table.wikitable")
# Initialize empty vectors to store the data
names <- c()
total_students <- c()
endowment <- c()
coordinates <- c()
# Loop through the first two tables
for (table_index in seq_along(universities_tables[1:2])) {
universities_table <- universities_tables[[table_index]]
# Extract the rows from the table
rows <- html_elements(universities_table, "tr")
# Loop through each row starting from the second row (skipping header)
for (i in 2:length(rows)) {
row <- rows[i]
# Extract columns from the row
columns <- html_elements(row, "td")
# Extract university name and URL
name <- html_text(columns[1])
url_element <- html_element(columns[1], "a")
university_url <- if (!is.null(url_element)) url_element %>% html_attr("href") else NA
full_url <- ifelse(is.na(university_url), NA, paste("https://en.wikipedia.org", university_url, sep = ""))
# Navigate to the university's dedicated page
university_page <- read_html(full_url)
#scrape the coordinates (lat and long. simultaneously)
coordinates_value <- university_page %>%
html_element(css =".geo-dms") %>%
html_text()
#scrape the endowment value but keep only the USD value
endowment_value <- university_page %>%
html_element(css = "table.infobox th:contains('Endowment') + td") %>%
html_text() %>%
gsub("\\s*[\\(\\[].*", "", .)
#scrape total number of students, but remove unneccesary information (different campuses etc.)
total_students_value <- university_page %>%
html_element(css= "table.infobox th:contains('Students') + td") %>%
html_text() %>%
str_replace_all("\\[[^\\]]*\\]|\\([^\\)]*\\)|[a-zA-Z]", " ")
total_students_value <- gsub(",", "", total_students_value)
if(!is.na(total_students_value) && any(str_detect(total_students_value, "\\s+"))){
individual_numbers <- as.numeric(strsplit(total_students_value, "\\s+")[[1]])
total_students_value <- sum(individual_numbers)
}else{
total_students_value <- as.numeric(total_students_value)
}
#Append the data to the vectors
names <- c(names, name)
coordinates <- c(coordinates, coordinates_value)
endowment <- c(endowment, endowment_value)
total_students <- c(total_students, total_students_value)
}
}
# Create a tibble with the collected data using tidyverse
universities_data <- tibble(
Name = names,
Coordinates = coordinates,
Endowment = endowment,
TotalStudents = total_students)
return(universities_data)
}
Now let’s call our function (this might take a while given all the pages we have to go to), and again save the results into a table. That way, we can merge them later.
# Call the scraping function
university_info_extra <- scrape_university_extra()
# I use head again here to avoid repeating the entire table
head(university_info_extra)
## # A tibble: 6 × 4
## Name Coordinates Endowment TotalStudents
## <chr> <chr> <chr> <dbl>
## 1 Arizona State University 33°25′15″N 111°56′02″W $1.47 billion 142029
## 2 Auburn University 32°36′11″N 85°29′10″W $1.05 billion 33015
## 3 Baylor University 31°32′53″N 97°06′58″W $1.97 billion 20626
## 4 Binghamton University 42°05′20″N 75°58′01″W $ 148.1 million 18148
## 5 Boston College 42°20′06″N 71°10′13″W $3.3 billion 15106
## 6 Boston University 42°20′56″N 71°06′01″W $3.2 billion 36729
You might get, like I did, a message saying NAs introduced by coercion. In this case, that’s okay. This is because Wikipedia pages aren’t identical for all the universities, and some information might be in different places. In a Data Science project, you might want to manually identify these and collect the data you need. But in this case, we’ll just be leaving the NAs in our final table.
Now that we have our two tables, we can merge them with a left-join command.
# merge the two table outputs from the two functions into one
merged_uni_table <- left_join(university_info_extra, R1R2_table, by = "Name")
head(merged_uni_table)
## # A tibble: 6 × 8
## Name Coordinates Endow…¹ Total…² Status City State URL
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 Arizona State University 33°25′15″N … $1.47 … 142029 Public Tempe "AZ\… http…
## 2 Auburn University 32°36′11″N … $1.05 … 33015 Public Aubu… "AL\… http…
## 3 Baylor University 31°32′53″N … $1.97 … 20626 Priva… Waco "TX\… http…
## 4 Binghamton University 42°05′20″N … $ 148.… 18148 Public Vest… "NY\… http…
## 5 Boston College 42°20′06″N … $3.3 b… 15106 Priva… Ches… "MA\… http…
## 6 Boston University 42°20′56″N … $3.2 b… 36729 Priva… Bost… "MA\… http…
## # … with abbreviated variable names ¹Endowment, ²TotalStudents
Now things are going to get a little more complicated. In this repository, you’ll find a file called ivyleague.csv. It contains information about ivy league universities. As you will have noticed, ivy leagues are included in our larger table, but there’s some information in our csv file that wikipedia cannot give us.
We will have to combine it in some way to have in our main table:
Let’s first read the data, and then use a regular expression to match the partial names in the csv file with those of the wider table.
ivys <- read.csv("ivyleague.csv")
# if a university matches the name partially in the large table and its private, then allocate
# "yes" to ivy status in a new column of the large table
# I manually exclude Teachers College at Columbia University for simplicity
merged_uni_table$IsIvy <- ifelse(
grepl(paste0(".*?(", paste(ivys$uni_name, collapse = "|"), ").*"),
merged_uni_table$Name, ignore.case = TRUE) &
!grepl("Teachers College at Columbia University", merged_uni_table$Name, ignore.case = TRUE) &
merged_uni_table$Status == "Private (non-profit)",
"Yes",
"No")
#now adding the full names to the ivys table so that they can be merged and add information
#about the counties and the ein
ivys$Name <- sapply(ivys$uni_name, function(uni) {
matching_names <- merged_uni_table$Name[grepl(uni, merged_uni_table$Name, ignore.case = TRUE) &
merged_uni_table$Status == "Private (non-profit)"]
if (length(matching_names) > 0) {
return(matching_names[1])
} else {
return(NA)
}
})
Now that we’ve done some data manipulation, we can merge the two tables and clean up the variable names. This way, when we upload it into our relational database, everything will match nicely. Make sure to concatenate the county and the state–it will make life easier later!
merged_uni_table <- left_join(merged_uni_table, ivys[, c("Name", "ein", "county")], by = c("Name" = "Name"))
#rename columns
merged_uni_table <- merged_uni_table %>%
rename(EIN_ivys = ein, County_ivys = county)
# Drop unnecessary columns from ivys data frame
ivys <- ivys[, !(names(ivys) %in% c("ein", "county"))]
#concatenating the county and state for all variables in merged_uni_table, where county won't show if info not available
#this will allow the table to be easily updated if county information is provided about any other university
merged_uni_table <- merged_uni_table %>%
mutate(county_state = ifelse(
is.na(County_ivys),
State,
paste(County_ivys, State, sep = ", ")
))%>%
select(-County_ivys)
We have a nice table with all the information about ivy-league universities and R1, R2 universities. Let’s write it to the database!
#writing table to database
dbWriteTable(demo_db, "R1R2_uni_list", merged_uni_table)
We’ll now be looking at a different aspect of universities that was not available on the Wikipedia page (hopefully you can start to see how this is an issue you’d encounter daily in a Data Science job!). We’ll be collecting the university rankings for all the ivy league institutions. We’ll use the ARWU page for this.
We’ll make a function to collect the ranking for the university for the years 2003, 2013, and 2023. If you have a quick browse on the website, you’ll notice that some rankings are provided as a range; eg. 77-100. In these cases, we’ll need to take the midpoint and record that as the ranking.
# here we'll need the RSelenium package
library(RSelenium)
scrapeIvyRankings <- function() {
list_ivy <- c("Harvard University", "Princeton University", "Yale University",
"Columbia University", "University of Pennsylvania", "Brown University",
"Dartmouth College", "Cornell University")
Ivy_rankings <- data.frame(
University = character(),
Ranking = character()
)
rank_url <- "https://www.shanghairanking.com/"
# Start the Selenium server:
rD <- rsDriver(browser=c("firefox"), verbose = F, port = netstat::free_port(random = TRUE), chromever = NULL)
driver <- rD[["client"]] # note this alternative but equivalent call for setting the driver client
# Navigate to the selected URL address
driver$navigate(rank_url)
# This step only need to happen once, as we can then easily just navigate to the date
ranking_page <- driver$findElement(using = "xpath", value = '//*[@id="arwu"]/div[1]/button')
ranking_page$clickElement()
for (i in list_ivy) { #for all the ivys
for (year in c(2003, 2013, 2023)) { #for all the years
# select the scrollable element of year at the top of the page
date_selector <- driver$findElement(using = 'class name', value = 'inputWrapper')
Sys.sleep(1)
date_selector$clickElement()
Sys.sleep(1)
# select year of interest in sequence, using modular xpath (2024-year of interest)
date_element_xpath <- paste0('//*[@id="bar-content"]/div[1]/div/div[2]/ul/li[', (2024 - year), ']')
date_element <- driver$findElement(using = "xpath", value = date_element_xpath)
date_element$clickElement()
Sys.sleep(1)
# Select the search bar for universities, clear it and type each university one by one
search_bar <- driver$findElement(using = "class", value = "search-input")
search_bar$clearElement()
search_bar$sendKeysToElement(list(i))
Sys.sleep(1)
search_bar$sendKeysToElement(list(key = "enter"))
Sys.sleep(2)
# Scrape the ranking
select_ranking <- driver$findElement(using = 'xpath', value = '//*[@id="content-box"]/div[2]/table/tbody/tr/td[1]/div')
ranking <- as.character(select_ranking$getElementText())
# Store the result in Ivy_rankings
Ivy_rankings <- do.call(rbind, list(Ivy_rankings,
data.frame(University = i,
Year = year,
Ranking = ranking)))
}
}
# closing the driver
driver$close()
rD$server$stop()
return(Ivy_rankings)
}
Now let’s call the function, do some cleaning of the rankings and write it to our database!
# Call the function to get Ivy League rankings
Ivy_rankings <- scrapeIvyRankings()
# clean the results before outputting the final table
# to turn the rankings to numbers and find midpoint for range measures
# I round the ranking to 0 decimal places, since half points in ranking are not substantively significant
ARWU_ivy_ranking <- Ivy_rankings %>%
mutate(
Ranking = ifelse(grepl("-", Ranking), gsub("-", " ", Ranking), Ranking),
Ranking = sapply(strsplit(Ranking, " "), function(x) if(length(x) > 1) sum(as.numeric(x)) / 2 else as.numeric(x)),
Ranking = round(Ranking,0)
)
# writing table to database
dbWriteTable(demo_db, "ARWU_ivy_ranking", ARWU_ivy_ranking)
Now that we’ve gathered the universitys overall ranking, let’s gather the subject-specific rankings. Since this is a social science project, we’ll gather the social science rankings for the year 2023 of each ivy. As a difficulty in this task, we have to deal with the fact that not all universities offer all social science subjects. For this reason, we’ll scrape the whole table for social sciences, and the missing subjects will just be excluded from the table.
social_science_rank <- function() {
# Input your list of universities here
list_ivy <- c("Harvard University", "Princeton University", "Yale University",
"Columbia University", "University of Pennsylvania", "Brown University",
"Dartmouth College", "Cornell University")
# Initialize the result table
result_table <- data.frame()
# Start the RSelenium driver
rD <- rsDriver(browser = c("firefox"), verbose = FALSE, port = netstat::free_port(random = TRUE), chromever = NULL)
driver <- rD[["client"]]
# Navigate to the selected URL address
driver$navigate("https://www.shanghairanking.com/")
for (i in list_ivy) {
# Navigate to the university page
university_page <- driver$findElement(using = "xpath", value = '//*[@id="__layout"]/div/div[1]/div[1]/div/div[2]/ul/li[3]/a')
university_page$clickElement()
Sys.sleep(0.5)
# Search for the search bar, clear it and type the university name
large_search_bar <- driver$findElement(using = "class", value = 'input')
large_search_bar$clickElement()
Sys.sleep(1)
large_search_bar$clearElement()
large_search_bar$sendKeysToElement(list(i))
Sys.sleep(1)
large_search_bar$sendKeysToElement(list(key = "enter"))
Sys.sleep(1)
# Navigate to the social sciences page
select_uni_page <- driver$findElement(using = "class", value = 'univ-main')
select_uni_page$clickElement()
Sys.sleep(1)
social_science_select <- driver$findElement(using = "class", value = "inputWrapper")
social_science_select$clickElement()
Sys.sleep(0.5)
social_science_click <- driver$findElement(using = "xpath", value = '//*[@id="gras"]/div[2]/div[1]/div[1]/div[2]/div/div[2]/ul/li[last()]')
social_science_click$clickElement()
Sys.sleep(0.5)
# Extract rankings table
social_sciences_rankings <- driver$findElement(using = 'class name', value = "table-container")
rankings_html <- read_html(social_sciences_rankings$getElementAttribute('innerHTML')[[1]])
# Transform the HTML table into a data frame
rankings_table <- html_table(rankings_html)[[1]]
# Add University column
rankings_table$University <- i
# Append to the result table
result_table <- rbind(result_table, rankings_table)
}
# Return the result table and close driver
driver$close()
rD$server$stop()
return(result_table)
}
Now we can call the function, as usual, and save it to our database after some cleaning. Again, here we have some ranks that are ranges, so we adjust these.
# Call to function
social_rank_table <- social_science_rank()
# If there's a range provided I take the mean ranking
# I round to 0 decimal places because "12.5" ranking is not substantively meaningful compared to 13 or 12
# Especially since the main interest of ranking is to compare universities with each other
clean_social_science <- social_rank_table %>%
mutate(
Rank = ifelse(grepl("-", Rank), gsub("-", " ", Rank), Rank),
Rank = sapply(strsplit(Rank, " "), function(x) if(length(x) > 1) sum(as.numeric(x)) / 2 else as.numeric(x)),
Rank = round(Rank, 0)
)
dbWriteTable(demo_db, "Social_Science_Ivy_Ranking", clean_social_science)
We’ve scraped websites and written web-crawlers. Now we use APIs to gather data (where possible, it’s always preferable to use APIs).
We’ll first work with the ProPublicaAPI where we’ll use the Organization Method to obtain for each ivy league university the:
For the years 2011-2021. When reading the API documentation, it becomes clear we have to use the EIN of the ivy universities for recovering the data. Below is the function which accesses the API.
# We'll need the httr package
library(httr)
get_ivy_info <- function() {
# Create an empty list to store individual data frames
results_list <- list()
# Assuming merged_uni_table is your dataset
# Replace "IsIvy" with the actual column name in your dataset
eins_list <- merged_uni_table %>%
filter(IsIvy == "Yes") %>%
pull("EIN_ivys") # Replace "EIN_ivys" with the actual column name in your dataset
# Loop through Ivy League universities and fetch information
for (ein in eins_list) {
api_url <- paste0('https://projects.propublica.org/nonprofits/api/v2/organizations/', ein, '.json')
# Make the GET request
response <- GET(api_url)
# Check if the request was successful (status code 200)
if (status_code(response) != 200) {
cat("Error:", status_code(response), "\n")
cat("--------------------------------------------------\n")
next() # Skip to the next iteration if there's an error
}
data <- content(response, "parsed")
# Extract relevant data
ein_data <- data.frame(
ein = data$organization$ein,
year = sapply(data$filings_with_data, function(x) x$tax_prd_yr),
revenue = sapply(data$filings_with_data, function(x) x$totrevenue),
assets = sapply(data$filings_with_data, function(x) x$totassetsend),
stringsAsFactors = FALSE
)
results_list <- bind_rows(results_list, ein_data)
}
return(results_list)
}
ivy_res <- get_ivy_info()
We’ve now used the method outlined in the API documentation to gather our information and formatted it into a table inside the function. That leaves running the function and cleaning the output table (and write it to the database of course).
# Cleaning the API output to be just the columns of interest
ivy_res_cleaned <- ivy_res %>%
mutate(ein = as.character(ein)) %>% #make sure variables are the same type to make merging easier
left_join(
merged_uni_table %>%
mutate(EIN_ivys = as.character(EIN_ivys)) %>%
select(EIN_ivys, Name),
by = c("ein" = "EIN_ivys")
) %>%
select(Name, ein, year, revenue, assets)
# writing table to database
dbWriteTable(demo_db, "ivy_fiscal_info", ivy_res_cleaned)
Another common way to interact with APIs is packaged APIs. This is the case for the tidycensus package in R, which conveniently allows us to interact with US census data. The documentation is available here. Note that to interact with it, you’ll need to create an account and save your key as you would with any other API in an .env file.
I follow the documentation closely to retrieve the names of all the Counties in the US and their estimated median household income for every county for both 2015 and 2020 (based on the American Community Survey (ACS)).
library(tidycensus)
# Read the key from my envs folder
readRenviron("tidycensusapi.env")
census_api_key <- Sys.getenv("tidycensus_key")
# choosing the county geography and the year 2015
# variable B19013_001 for the median household income adjusted for inflation
housing_income2015 <- get_acs(geography = "county",
variables = c(medincome = "B19013_001"),
year = 2015)
housing_income2020 <- get_acs(geography = "county",
variables = c(medincome = "B19013_001"),
year = 2020)
# I take ivy universities and merge them with the 2015 housing income table. For this, I have to match the first part of the county
housing_info2015 <- merged_uni_table %>%
filter(IsIvy == "Yes") %>%
mutate(county_state_match = str_extract(county_state, "^[^,]+, [A-Z]")) %>%
left_join(housing_income2015 %>%
select(NAME, estimate) %>%
mutate(county_state_match = str_extract(NAME, "^[^,]+, [A-Z]")),
by = "county_state_match") %>%
distinct(county_state_match, .keep_all = TRUE) %>%
select(Name, county_state, estimate) %>%
rename("2015" = estimate) %>%
mutate(county_state = str_replace(county_state, "\n", ""))
# Repeat the process but adding 2020 as well
# Then we can pivot it longer
housing_info2020 <- housing_info2015 %>%
mutate(county_state_match = str_extract(county_state, "^[^,]+, [A-Z]")) %>%
left_join(housing_income2020 %>%
select(NAME, estimate) %>%
mutate(county_state_match = str_extract(NAME, "^[^,]+, [A-Z]")),
by = "county_state_match") %>%
distinct(county_state_match, .keep_all = TRUE) %>%
rename("2020" = estimate) %>%
select(Name, county_state, "2015", "2020")
Now that we’ve gathered the data from the API, and dealt with the differing county names, we can pivot it longer to have tidy-long data. We’ll also write it to the database.
housing_info_long <- housing_info2020 %>%
pivot_longer(cols = c("2015", "2020"), names_to = "year", values_to = "estimate")
# Write in db
dbWriteTable(demo_db, "ivy_county_income", housing_info_long)
We’ve gathered a lot of information about universities and ivys! Now is the time to make full use of our database and retrieve information combinations that are interesting.
For this doing, we’ll use SQL and its integrated R version (RSQLite) to retrieve information across tables into a single table. We’ll save that table and use it for our analysis, this way, we don’t have to filter through information.
In the SQL query below, I join all the tables we made thus far into one and save it. This requires a series of JOIN commands. I retrieve these fields:
# Connecting to the database
db <- dbConnect(RSQLite::SQLite(), "demo_db.db")
insights_table <- dbGetQuery(db, "WITH EndowmentCTE AS (
SELECT
R1R2_uni_list.Name AS University,
(CASE
WHEN R1R2_uni_list.Endowment LIKE '%$%' AND R1R2_uni_list.Endowment LIKE '%billion%' THEN
CAST(REPLACE(REPLACE(R1R2_uni_list.Endowment, '$', ''), ' billion', '') AS DECIMAL(20, 2)) * 1000000000
WHEN R1R2_uni_list.Endowment LIKE '%$%' AND R1R2_uni_list.Endowment LIKE '%million%' THEN
CAST(REPLACE(REPLACE(R1R2_uni_list.Endowment, '$', ''), ' million', '') AS DECIMAL(20, 2)) * 1000000
ELSE
CAST(REPLACE(R1R2_uni_list.Endowment, '$', '') AS DECIMAL(20, 2))
END / R1R2_uni_list.TotalStudents) AS EndowmentPerStudent,
a.AverageOverallRanking,
s.SocialScienceRank
FROM R1R2_uni_list
LEFT JOIN ARWU_Ivy_ranking ivy ON R1R2_uni_list.Name = ivy.University
LEFT JOIN (
SELECT
s1.University,
ROUND(AVG(r.Ranking), 2) AS AverageOverallRanking
FROM Social_Science_Ivy_Ranking s1
LEFT JOIN ARWU_ivy_ranking r ON s1.University = r.University
GROUP BY s1.University
) a ON R1R2_uni_list.Name = a.University
LEFT JOIN (
SELECT
s2.University,
ROUND(AVG(s2.Rank), 2) AS SocialScienceRank
FROM Social_Science_Ivy_Ranking s2
WHERE s2.Subject IN ('Economics', 'Political Sciences', 'Sociology')
GROUP BY s2.University
) s ON R1R2_uni_list.Name = s.University
WHERE ivy.University IS NOT NULL
GROUP BY s.University
),
CountyIncomeCTE AS (
SELECT
Name AS University,
county_state,
ROUND(AVG(estimate), 2) AS AverageCountyIncome
FROM ivy_county_income
GROUP BY Name, county_state
),
RevenuePerStudentCTE AS (
SELECT
i.Name,
ROUND(AVG(i.revenue / COALESCE(r.TotalStudents, 1)), 2) AS AverageRevenuePerStudent
FROM ivy_fiscal_info i
LEFT JOIN R1R2_uni_list r ON i.Name = r.Name
GROUP BY i.Name
)
SELECT
e.University,
c.county_state AS County,
e.EndowmentPerStudent,
e.AverageOverallRanking,
e.SocialScienceRank,
c.AverageCountyIncome,
r.AverageRevenuePerStudent
FROM EndowmentCTE e
LEFT JOIN CountyIncomeCTE c ON e.University = c.University
LEFT JOIN RevenuePerStudentCTE r ON e.University = r.Name;
")
Good news, we’ve finished with our data collection and cleaning. We can now do some fun plotting with ggplot! We’ll plot the following:
ggplot is by far the easiest way to do this:
library(ggplot2)
# Plot 1: Average university ranking vs. average Econ/PS/Soc ranking
ggplot(insights_table, aes(x = AverageOverallRanking, y = SocialScienceRank)) +
geom_point(size = 4, color = "hotpink") +
labs(title = "Figure 1: Average University Ranking vs. Average Econ/PS/Soc Ranking",
x = "Average University Ranking",
y = "Average Econ/PS/Soc Ranking") +
geom_smooth(method='lm', col = "blue", linetype = "dotted", fill = "lightpink", alpha = 0.2) + #adding regression line
annotate("text", x = 200, y = 65, label = "Regression Line", color = "black", size = 3) + #labelling regression line
theme_minimal()
# Plot 2: Average university ranking vs. endowment per student
ggplot(insights_table, aes(x =AverageOverallRanking , y = EndowmentPerStudent)) +
geom_point(size = 5, color = "hotpink") +
labs(title = "Figure 2: Average University Ranking vs. Endowment Per Student (USD) ",
x = "Average University Ranking ",
y = "Endowment Per Student (USD)")+
geom_smooth(method='lm', se = FALSE, col = "blue", linetype = "dotted") +
scale_y_continuous(labels = scales::comma)+
annotate("text", x = 200, y = 400000, label = "Regression Line", color = "black", size = 3) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Plot 3: Endowment per student vs. average median household income
lab_pos <- data.frame(University =c("Harvard University", "Princeton University", "Yale University",
"Columbia University", "University of Pennsylvania", "Brown University",
"Dartmouth College", "Cornell University"),
x = c(2300000, 3800000, 3400000, 620000, 800000, 800000, 1300000, 500000),
y = c(92000, 83000, 72000, 84000, 47000, 53000, 65000, 59000))
ggplot(insights_table, aes(x = EndowmentPerStudent, y = AverageCountyIncome, size = EndowmentPerStudent)) +
geom_point(aes(color = University), alpha = 0.7, show.legend = FALSE) +
scale_size_continuous(range = c(3, 12)) + #dot size range
labs(title = "Figure 3: Endowment Per Student vs. Average Median Household Income",
x = "Endowment Per Student (USD)",
y = "Average Median Household Income (USD - Adjusted for Inflation)",
caption = "Bubble size = Endowment Per Student") +
scale_x_continuous(labels = scales::comma)+
scale_y_continuous(labels = scales::comma)+
geom_text(data = lab_pos, aes(x = x, y = y, label = University), size = 3)+
theme_minimal() +
theme(legend.position = "bottom", legend.title = element_blank())
# Plot 4: Average revenue per student vs. average median household income
# put data into long format so that it's easier to make levels for the data
insights_long <- insights_table %>%
mutate(University = factor(University)) %>%
pivot_longer(cols = c(AverageRevenuePerStudent, AverageCountyIncome),
names_to = "Variable", values_to = "Values")
# grouped bar plot
ggplot(data = insights_long, aes(x = University, y = Values, fill = Variable)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
labs(title = "Figure 4: Average Revenue per Student vs. Average Median Household Income",
x = NULL, y = "USD") +
scale_y_continuous(labels = scales::comma, limits = c(0, 550000)) +
scale_x_discrete(labels = NULL) +
theme_minimal() +
facet_grid(. ~ Variable) +
geom_text(aes(label = University),
position = position_dodge(width = 0.9),
size = 3,
vjust = 0.5,
angle = 90,
hjust = -0.3)+
guides(fill = FALSE)
# Plot 5: Average revenue per student vs. average university ranking
lab_pos1 <- data.frame(University = c("Harvard University", "Princeton University", "Yale University",
"Columbia University", "University of Pennsylvania", "Brown University",
"Dartmouth College", "Cornell University"),
x = c(315000, 230000, 365000, 135000, 280000, 140000, 190000, 190000),
y = c(0, 7, 21, 25, 30, 50, 210, 26))
ggplot(insights_table, aes(x = AverageRevenuePerStudent, y = AverageOverallRanking, size = AverageOverallRanking, fill = University)) +
geom_point(aes(color = University), alpha = 0.7, show.legend = FALSE) +
labs(title = "Figure 5: Average Revenue Per Student vs. Average University Ranking",
x = "Average Revenue Per Student (USD)",
y = "Average University Ranking",
caption = "Bubble size = Ranking (large bubble = low ranking)") +
scale_x_continuous(labels = scales::comma) +
geom_text(data = lab_pos1, aes(x = x, y = y, label = University), size = 3)+
scale_size_continuous(range = c(10, 3)) + # Adjust the range for bubble sizes
theme(legend.position = "none")+
theme_minimal()+ theme(legend.position = "bottom", legend.title = element_blank())
The graphs we just plotted were informative and fun, but what about an interactive plot? Here we’ll make a map that shows:
For this doing, we’ll first need another SQL query that will give us:
We’ll use the tmap package and the tigris package to retrieve a shapefile of the US.
# Connect to database and use simple SQL query to get the information
db<-dbConnect(RSQLite::SQLite(), "demo_db.db")
university_data <- dbGetQuery(db, "SELECT
Name,
Coordinates,
Status,
IsIvy
FROM R1R2_uni_list
")
options(tigris_use_cache = TRUE) #follow the documentation on how to retrieve data from package
us_map <- tigris::states(class = "sf") # retrieve shapefile from data
# separate the latitude and longitude
processed_data <- university_data %>%
mutate(
Coordinates = sapply(strsplit(Coordinates, "\\s+"), function(x) paste(x[1], x[2], sep = ",")),
Latitude = sapply(strsplit(Coordinates, ","), `[`, 1),
Longitude = sapply(strsplit(Coordinates, ","), `[`, 2)
) %>%
select(-Coordinates)
# function which transforms DMS coordinates into decimals
dms_to_decimal <- function(coord) {
# Extract degrees, minutes, seconds, and direction using regex
parts <- strsplit(coord, "[^0-9.]+")[[1]]
# Convert parts to numeric values
degrees <- as.numeric(parts[1])
minutes <- ifelse(length(parts) >= 3, as.numeric(parts[2]), 0)
seconds <- ifelse(length(parts) >= 4, as.numeric(parts[3]), 0)
# calculate decimal degrees
decimal <- degrees + minutes / 60 + seconds / 3600
return(decimal)
}
# load the library tmap
library(tmap)
# apply to data
processed_data$Latitude <- sapply(processed_data$Latitude, dms_to_decimal)
# multiply by -1 to have the negative longitude (West)
processed_data$Longitude <- sapply(processed_data$Longitude, dms_to_decimal) * (-1)
# Filter out rows with missing coordinates
coordinates_data <- processed_data[complete.cases(processed_data[, c("Longitude", "Latitude")]), ]
# Convert IsIvy to a factor with levels rather than a character object
coordinates_data$IsIvy <- factor(coordinates_data$IsIvy, levels = c("No", "Yes"))
library(sf)
# Convert coordinates_data to sf for compatibility
sf_coordinates_data <- st_as_sf(coordinates_data, coords = c("Longitude", "Latitude"), crs = 4326)
# create a first layer for the public and private universities
status_layer <- tm_shape(sf_coordinates_data %>% filter(IsIvy == "No")) +
tm_dots(
size = 0.1,
col = "Status",
palette = c("royalblue", "lightgreen"),
legend.show = FALSE
)
# create a second layer for the ivys
ivy_layer <- tm_shape(sf_coordinates_data %>% filter(IsIvy == "Yes")) +
tm_dots(size = 0.1,
col = "IsIvy",
palette = c("red"),
legend.show = FALSE)
# get map
tm_us <- tm_shape(us_map) +
tm_borders(alpha = 0.03)
# create the legend
legend <- tm_add_legend(type = "fill", col = c("lightgreen", "royalblue", "red"),
labels = c("Public", "Private (non-profit)", "Ivy"),
title = "University Type")
# create interactive map
map <- tm_us + status_layer + ivy_layer +
tm_layout(title = "Universities in the United States", legend.position = c("left", "bottom")) +
legend
# Display the map
tmap_mode("view")
map